Copulas and synthetic data play pivotal roles in statistical modeling, offering innovative solutions for various challenges in Machine Learning. Here, I will focus into the use of copulas for synthetic data generation.
Copulas are mathematical constructs used to model the dependence structure between random variables. Unlike traditional correlation measures, copulas separate the marginal distributions from the dependence structure, providing a more flexible and nuanced approach to capturing complex relationships. They are particularly useful in scenarios where traditional models might fail to capture the intricate dependencies between variables. In the specific case of syntetic data generation, what we need is to mimics the statistical properties of real-world data. But why do we need this “fake” data in the first place? Syntetic data are invaluable in scenarios where obtaining sufficient real data is challenging (small sample size) or when privacy concerns limit access to actual data. By creating synthetic datasets we can augment the available data, facilitating better model generalization and robustness.
Going back on the statistical properties we mentioned earlier, we are interested in the parameters governing the distribution of each variable separately (the marginals) and the dependency structure between them (the copula). Once these are known, we can generate new data from the same distribution and with the same correlation.
To give a simple example let’s take few variables from the classic Cars Dataset.
import warningswarnings.filterwarnings('ignore')import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom copulas.multivariate import GaussianMultivariatefrom copulas.univariate import ParametricType, Univariatedf = sns.load_dataset("mpg")df=df.drop(columns=['origin', 'name'])df=df.dropna()df.columnsdf=df[['horsepower', 'weight','acceleration','mpg']]df.describe()
horsepower
weight
acceleration
mpg
count
392.000000
392.000000
392.000000
392.000000
mean
104.469388
2977.584184
15.541327
23.445918
std
38.491160
849.402560
2.758864
7.805007
min
46.000000
1613.000000
8.000000
9.000000
25%
75.000000
2225.250000
13.775000
17.000000
50%
93.500000
2803.500000
15.500000
22.750000
75%
126.000000
3614.750000
17.025000
29.000000
max
230.000000
5140.000000
24.800000
46.600000
Let’s plot the kernel density distribution of 3 variables, the scatter plot of each pair and the corresponding correlation.
We can see all sorts of things here. Aside from the strong correlation among some of the variables, we see that they have different distribution. For example, acceleration is very normal distributed but the same cannot be said about the other two variables.
Before anything else, let’s try a simple model to predict mpg.
y = df.pop('mpg')X = dfX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)model = LinearRegression()model.fit(X_train, y_train)print(model.score(X_test, y_test))
0.6501833421053663
Now, can we simulate something so similar to the actual data that we would get the same score? Yes, we can thanks to copulas!!! We ca generate a synthetic dataset with the same underlying structure.
# Select the best PARAMETRIC univariate (no KDE)univariate = Univariate(parametric=ParametricType.PARAMETRIC)def create_synthetic(X, y):""" This function combines X and y into a single dataset D, models it using a Gaussian copula, and generates a synthetic dataset S. It returns the new, synthetic versions of X and y. """ dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1) distribs = GaussianMultivariate(distribution=univariate) distribs.fit(dataset) synthetic = distribs.sample(len(dataset)) X = synthetic.values[:, :-1] y = synthetic.values[:, -1]return X, y, distribsX_synthetic, y_synthetic, dist= create_synthetic(X_train, y_train)
Let’s look at the individual distributions fitted by the algorithm.
We see that the distributions (Gamma and Beta), and their corresponding parameters like location and scale. We can also take a look at the correlation that defines the join distribution.
Now it is time to look at all the synthetic variables and compare them with the original one. Let’s look at the same things. A summary of the dataset and the plot of the 3 variables.
The descriptive statistics are remarkably similar, reflecting the statistical properties emphasized earlier. This holds significant importance in our analysis. However, a closer examination of individual variable distributions and their correlations reveals disparities. The Kernel distributions have evidently undergone changes, and although the correlation values remain within the same magnitude range and exhibit the same sign, they are not identical.
Now that we have seen similarities and differences, let’s try to run the same simple linear model on the synthetic data.
model = LinearRegression()model.fit(X_synthetic, y_synthetic)print(model.score(X_test, y_test))
0.6068245805913557
Upon scrutinizing the results, it is evident that they are highly comparable, even with the constraint of limiting the possible distributions to the simplest univariate forms and utilizing only three variables. This implies that our Gaussian copula has effectively captured the critical statistical characteristics of the dataset essential for addressing the regression problem.